Support Whisper training with Google Cloud buckets by huwenjie333 · Pull Request #70 · SunbirdAI/salt

huwenjie333 · 2026-03-06T08:12:39Z

This PR made the following changes to the salt library for running latest whisper finetuning :

during dataset loading:
- add support for google cloud buckets.
- add supports to limit number of examples per dataset.
- skips the complicated source-targe matching logics for ASR model with skip_matching_asr argument.
- iterate validation datasets also with multi-threads while keeping the same order.
update SALT_LANGUAGE_TOKENS_WHISPER in constants.py with 51 African languages for new whisper-salt ASR model training.
optimize the speech of multilingual_eval_fn in metrics.py by skipping unnecessary heavy CPU audio decoding process.
fixed a bug in augment_audio_noise function in preprocessing.py that makes the output audio to be zero-size.

The whisper finetuning/training scripts and configs has been moved sunbird-speech repo under speech-to-text/whisper directory.

[Depreciated]

This PR adds the support of google cloud buckets for the whisper training pipeline, and made several other changes:

load the parquets datasets from gcs:// path with datasets.load_dataset and cast the audio column to datasets.Audio format.
create a setup shell script that intalls the dependencies and configure the google cloud credentials.
load modules such as salt.datasets from the current repo instead of https://github.com/jqug/salt.git
move the yaml config from training notebook to a separate file
fix several training errors with following changes:
- Use BF16 instead of FP16 for training
- set gradient_checkpointing=False
- add torch_dtype=torch.float32 when loading the model weights
- updates model.generation_config based on requirements from the new version.

Overfit experiment

An overfit experiment with just 100 examples was done to verify the changes:
MLflow run1 with evaluation metrics: https://mlflow-sunbird-ce0ecfc14244.herokuapp.com/#/experiments/0/runs/2d488acdc39146e9af9da07c00128d49/model-metrics
MLfLow run2 with GPU utilization: https://mlflow.sunbird.ai/#/experiments/0/runs/811bbdf051f44597bd90c3376cfc9309/system-metrics

TODO

we need to update the salt.constants.SALT_LANGUAGE_TOKENS_WHISPER to support new languages. Currently we only have the following:

SALT_LANGUAGE_TOKENS_WHISPER = {
    # Exact/close mapping
    'eng': 50259,
    'swa': 50318,
    # Overwrite unused language tokens
    'ach': 50357,
    'lgg': 50356,
    'lug': 50355,
    'nyn': 50354,
    'teo': 50353,
    'xog': 50352,
    'ttj': 50351,
    'kin': 50350,
    'myx': 50349,
    'kik': 50348,
}

currently each evaluation step takes 3-4 mins. I'm not sure whether it is expected

review-notebook-app · 2026-03-06T08:12:45Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jqug

Thanks for this, looks good.
Just one thing, let's take out the gcloud auth for now and maybe mention in a comment in the file that this may be necessary.

ak3ra · 2026-03-09T12:30:53Z

We should consider merging this notebook into the dedicated sunbird-speech repo:
I have been working on refactoring it here https://github.com/SunbirdAI/sunbird-speech

…ngual_eval_fn processing; fix label 448 limit; launch full training

… fix preprocess error

jqug

Thanks, LGTM

I double checked the language token IDs comparing with the Whisper tokenizer, and they look right. Actually I didn't realise that Whisper supports so many African languages already :)

jqug · 2026-04-08T16:19:51Z

+                    "target": row.get("text"),
+                    "target.language": row.get("language"),
+                }
+                yield example


This looks good. A further improvement for later, in case it's an ASR/audio dataset and the format already matches, is not to use a generator at all - we just load the huggingface datasets and concatenate them. That should reduce CPU bottleneck and could improve GPU utilisation.

huwenjie333 added 3 commits March 4, 2026 17:19

init

38f72f0

fixes to start training

9532576

training updates

e4e269c

huwenjie333 added 2 commits March 6, 2026 17:26

clean up dataset load and notebook; add gpu metrics

5dad82f

support current huggingface_load

c02b14d

huwenjie333 changed the title ~~[WIP] Run Whisper training with Google Cloud buckets~~ Run Whisper training with Google Cloud buckets Mar 6, 2026

huwenjie333 requested review from ak3ra, evie-8 and jqug March 6, 2026 12:02

jqug reviewed Mar 9, 2026

View reviewed changes

Comment thread whisper_training_setup.sh Outdated

huwenjie333 added 5 commits March 11, 2026 16:19

add gcs_key_path

4597e15

update SALT_LANGUAGE_TOKENS_WHISPER

79d7fa5

add all gcs datasets

ba89074

update datasets in yaml by script

043ee6f

fix empty folder; fix download_datasets_in_parallel; fix long multili…

8899665

…ngual_eval_fn processing; fix label 448 limit; launch full training

jqug reviewed Mar 17, 2026

View reviewed changes

Comment thread notebooks/training/configs/whisper_finetuning_gcs.yaml Outdated

huwenjie333 added 8 commits March 18, 2026 20:36

multi-thread for valid dataset; skip matching

91fea72

valid max_examples_per_dataset 50

78cea35

train with script; epoch=2; fix dataset max 50;

78b95bc

add back data aug; calculate train steps by epoch; eval show progress

398087f

reduce eval max exmaple to 20; max_steps: 8000; add eval predict log;…

8947be5

… fix preprocess error

disable augment_audio_noise

c1dac1a

move scrits and configs to sunbird-speech repo

8fb5250

fix augment_audio_noise error

ddfc940

huwenjie333 changed the title ~~Run Whisper training with Google Cloud buckets~~ Support Whisper training with Google Cloud buckets Apr 6, 2026

huwenjie333 requested a review from jqug April 8, 2026 03:23

jqug approved these changes Apr 8, 2026

View reviewed changes

huwenjie333 merged commit d3d1ead into main Apr 13, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Whisper training with Google Cloud buckets#70

Support Whisper training with Google Cloud buckets#70
huwenjie333 merged 18 commits intomainfrom
whisper_gcp

huwenjie333 commented Mar 6, 2026 •

edited

Loading

Uh oh!

review-notebook-app bot commented Mar 6, 2026

Uh oh!

jqug left a comment

Uh oh!

Uh oh!

ak3ra commented Mar 9, 2026

Uh oh!

Uh oh!

jqug left a comment

Uh oh!

jqug Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huwenjie333 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Depreciated]

Overfit experiment

TODO

Uh oh!

review-notebook-app bot commented Mar 6, 2026

Uh oh!

jqug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ak3ra commented Mar 9, 2026

Uh oh!

Uh oh!

jqug left a comment

Choose a reason for hiding this comment

Uh oh!

jqug Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huwenjie333 commented Mar 6, 2026 •

edited

Loading